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Abstract 

Light and Widely Applicable (LWA-) MCMC is a novel approximation of the Metropolis- 
Hastings kernel targeting a posterior distribution defined on a large number of observa¬ 
tions. Inspired by Approximate Bayesian Computation, we design a Markov chain whose 
transition makes use of an unknown but fixed, fraction of the available data, where the 
random choice of sub-sample is guided by the fidelity of this sub-sample to the observed 
data, as measured by summary (or sufficient) statistics. LWA-MCMC is a generic and 
flexible approach, as illustrated by the diverse set of examples which we explore. In each 
case LWA-MCMC yields excellent performance and in some cases a dramatic improve¬ 
ment compared to existing methodologies. 
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1. Introduction 

The development of statistical methodology which scales to large datasets represents a 
significant research frontier in modern statistics. This paper presents a generic and flexible 
approach to directly address this challenge. Given a set of observed data (V,..., Y N ), a 
specified prior distribution p and a likelihood function /, estimating parameters 6 G 0 of 
the model proceeds via exploration of the posterior distribution n defined on (0,^(0)) 
by 

7T(d 0 I Y u ..., Y n ) oc f(Y h ...,Y n I 0)p(d0 ). (1) 

Stochastic computation methods such as Monte Carlo methods allow one to estimate 
characteristics of it. In Bayesian inference, Markov chain Monte Carlo (MCMC) methods 
remain the most widely used strategy, Paradoxically, improvements in data acquisition 
technologies together with increased storage capacities, present a new challenge for these 
methods. Indeed, the size of the data set N (along with the dimension of each observation) 
can become so large, that even a routine likelihood evaluation is made prohibitively 
computationally intensive. As a consequence, methods such as the Metropolis-Hastings 
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sampler cannot be realistically considered. This issue has recently generated a lot of 
research activity. 

While some authors have designed exact algorithms that tend to match the theoretical 
requirements of usual MCMC methods, others have considered the possibility of approxi¬ 
mate, or noisy, methods while still trying to derive some quantitative error bound on the 
resulting approximation scheme. In this context, we refer to as exact any method that 
produces -possibly dependent- samples from n, as opposed to noisy, which samples from 


an approximation of 7r. In Pseudo-marginal algorithms (Andrieu and Roberts, 2009), the 


evaluation of the likelihood function is substituted by an unbiased and positive estima¬ 
tor, while still preserving the stationary distribution 7r and thus being exact. Although 


theoretically appealing (Andrieu and Vihola, 2015), finding an unbiased and positive es 


timator of the likelihood turns out to be a challenging problem in itself. Two other more 


recent exact approaches (Banterle et al., 2014, Maclaurin and Adams, 2014) overcome 


the need for an unbiased and positive estimator of the likelihood. However, the method 


proposed in Maclaurin and Adams (2014) requires the specification of a lower bound of 
the likelihood - which is a strong assumption especially for high-dimensional problems. 
A poor lower bound function yields a method whose computational complexity can even 


exceed that of a M-H sampler. In Banterle et al. (2014), the authors suggest to break 
the usual M-H ratio into N independent decision steps (each corresponding to a factor 
involving the likelihood of a datum) such that a proposal is globally rejected as soon as it 
is rejected by an elementary step. This sticky version of the M-H sampler is nevertheless 
shown to be 7T-stationary but yields an higher asymptotic variance by a straightforward 


Peskun comparison argument with the standard M-H kernel (Tierney, 1998) 


An alternative solution consists of finding an approximation of the M-H transition 
kernel, that would emulate the outcome of the accept/reject step without having to 
compute the likelihood ratio in the M-H transition kernel. In Bayesian settings where 
the observed data are independent and identically distributed (i.i.d.), the likelihood is 
expressed as 


N 


f(Y u ...,Y N \9)<xl[f(Y k \e) 


k= 1 


The issue of the reliability of making a decision to accept or reject a move based only a 


subset of these factors has been recently addressed (Korattikara et ah, 2014, Bardenet 


et ah, 2014). In these papers, the usual M-H acceptance decision, which can be rewritten 


as 


^ log 
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p{0)q{9,9') 


p(9')q(9',9) J ~ N 
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( 2 ) 


is made using a Monte Carlo approximation 


Pn = r^ lo STwM4v’ (u 1 ,..., u n )e{l,..., N } n , { Ui ± Uj ) 


n 


k= 1 


f(Y Uk | 9) ’ 


of the right hand side of ([2]). Both methods proposed in Korattikara et al. (2014) and 
Bardenet et al. (2014) share the same principle that 


(a) a decision can be made with a certain level of confidence as soon as p n becomes 
sufficiently far from the left-hand side of (|2]) , 
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(b) and if this condition is not reached, n is increased. 


They nevertheless differ by the way the level of confidence is derived and by the theoretical 
arguments motivating the approximation. However, practically, as observed in some 
examples presented in Korattikara et al. (2014) and Bardenet et al. (2014), both methods 
tend to draw a significant portion of the data (i. e n —> N), in order to reach the confidence 
interval when the Markov chain gets closer to stationarity. Finally, note that noisy 
algorithms may retain some theoretical appealing feature. This was developed in Alquier 


et al. (2014), where the authors show that using an approximation of an unavailable exact 


transition kernel can still provide ergodic Markov chains. In particular, this yields a class 


of general noisy M-H algorithms, extending the pseudo-marginal approach (Andrieu and 


Roberts, 2009), while relaxing the unbiased estimator assumption. 


In this paper, we propose Light and Widely Applicable MCMC (LWA-MCMC), a novel 
methodology which aims to make the best use of a computational resource available for 
a given computational run-time, while still preserving the celebrated simplicity of the 
standard M-H sampler. Our approach designs a Markov chain on an extended state space 
whose marginal in 6 targets an approximation of n. As a result, our algorithm can be 
cast as a noisy MCMC method. At each transition of the Markov chain, a new candidate 
is proposed and accepted/rejected through a probability that only uses a fraction, n/N, 
of the available data which is by construction -and contrary to Maclaurin and Adamsj 
(2014), Korattikara et al. (2014) and Bardenet et al. (2014)- held constant throughout 


the algorithm. Moreover, unlike most of the papers mentioned before, LWA-MCMC can 
be applied to virtually any model (involving i.i.d. data or not), as it does not require any 
assumption on the likelihood function nor on the prior distribution. 

The original target 7r is extended to model a joint distribution between the parameter 
of interest 9 e 0 and an auxiliary TV-dimensional boolean vector identifying the data 
involved in the subset.Each possible subset of data of size n is weighted according to 
a similarity measure with respect to the full set of data, in the spirit of the Approxi¬ 
mate Bayesian Computation (ABC) (see e.g. Marin et ah, 2012). In the special case of 


i.i.d. realizations from an exponential model, we prove that when the similarity measure 
is identical the sufficient statistics, this yields an optimal approximation, in the sense of 
minimizing an upper bound of the Kullback-Leibler (KL) divergence between n and the 
marginal target of our method. 

Our main finding is two-fold: 


• for a fixed computational budget, our method can achieve a better bias/variance 
tradeoff compared to a standard Metropolis-Hastings algorithm and the method 
developed in Bardenet et al. (2014) 

• we observe in different scenarios the existence of a tradeoff between the quality and 
the size n of the batch of the sub-sampled of data, highlighting an LWA-MCMC 
optimal setting. 

We start in Section [2] with a striking real data example which we hope will help the 
reader to understand the problem we address and motivate the solution we propose, with¬ 
out going into further technical details at this stage. In Section [3j we provide theoretical 
results concerning exponential-family models, which we illustrate through a probit exam¬ 
ple. This section allows us to justify our motivations supporting the LWA-MCMC general 
methodology developed in Section [4j we define the transition kernel on the extended state 
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space and show that it yields a Markov chain targeting, marginally, an approximation 
of 7i. Finally, in Section [5j onr method is used to estimate parameters of a time-series 
model and to perform a binary classification task. In the latter example, we compare 


the performance of our algorithm with the SubLikelihood approach proposed in Bardenet 


et ah (2014). 


2. An introductory example 

We address the problem of estimating template shapes of handwritten digits from the 
MNIST database (http://yann.lecun.com/exdb/mnist/) by inferring a partially known 


deformable template model (Allassonniere et ah, 2007). Here, a 16 x 16 matrix represents 


a digit whose conditional distribution given its class (0,1,..., 9) corresponds to a random 
deformation of the template shape, parameterized by a d = 256 dimensional vector 9 , of 
the known digit. Assuming small deformations, we can rewrite the model as a standard 
regression problem: 

given 4 = i, Y k = 0(4) + cr 2 e k (3) 

where 4 £ {0, - - -, 9} is the class of observation Y k (regarded as a vector M 225 ), 0 : 
M 256 -»■ M 225 is some deterministic linear mapping, cr > 0 is a variance parameter and 
Cj r ' s “' A/" (0 2 25) M 225 ) is additive noise. Given a set of N = 10,000 labeled images and 
defining a prior distribution for 6 = {9±, ..., 0g}, one can estimate 6 through its posterior 
distribution, for example using a standard Metropolis-Hastings algorithm. However, for 
two main reasons, the efficiency of such an approach can be questioned: (i) computing 
the N likelihoods in the M-H ratio dramatically slows down each transition and (ii) the 
highly peaked posterior distribution hinders a quick exploration of the state space. 

Based on these observations, our approach aims at working with different subsets of 
data which addresses issues (i) and (ii). At this stage, we do not provide precise details 
of the LWA-MCMC machinery nor its accuracy but we rather provide an insight of the 
rationale of this approach: we design a Markov chain whose transition kernel targets the 
posterior distribution of the parameter of interest 9 given a random subset of n data 
(n <C N). More specifically, we inject in the standard M-H transition a decision about 
refreshing the subset of data, which, as a result, will change randomly over time. In this 
example, we use the knowledge of the observation labels to promote subsets in which the 
proportion of each digit is balanced. 

We considered only five digits, 1,..., 5 (for illustration purposes), subsets of size n = 
100 and a non-informative Gaussian prior for 9, as specified in Allassonniere et ah (2007). 


Figure [l] indicates a striking advantage of our method compared to a standard M-H using 
the N = 10, 000 data. In this scenario, we allow a given computational budget (1 hour) for 
both methods and compare the estimation of the mean estimate of the two Markov chains. 
Qualitatively, the upper part of Figure [l] compares the estimated template shapes of the 
five digits at different time steps and shows that our method allows to extract template 
shapes much quicker than the standard M-H, while still reaching an apparent similar 
graphical quality asymptotically (after one hour). This fact is conhrmed quantitatively, in 
the lower part of Figure [lj which plots, against time and for both methods, the Euclidean 
distance between the Markov chain mean estimate and the Maximum Likelihood Estimate 
(91,... ,91) obtained using a stochastic EM (Allassonniere et ah, 2007). More precisely, 
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we compare the real valued function {d(t), t G M} defined as 
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d(t) = ll 0 i where 

i =1 


for the two Markov chains. 


Vi G 


n(t) = max fceN {i > r fc } , 


Tk is the time at the end of the k -th iteration , 
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Figure 1: Efficiency of template estimation through M-H and LWA-MCMC. 

LWA-MCMC provides very encouraging results for this real-data example. We for¬ 
malize the method and sketch theoretical arguments in the next two sections. 


3. Approximation of the posterior distribution in exponential models: an 
optimality result 

In this section, we consider the case of N independent and identically distributed 
( i.i.d .) observations from an exponential model. Taking the posterior distribution given 
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all the N available data as the target distribution, which we call the full posterior, we con¬ 
sider the posterior distribution of the parameter of interest given only a subset of the N 
observations as an approximation of the full-posterior, which we call a sub-posterior. We 
investigate the influence of the choice of a subset of n data on this approximation. Propo¬ 
sition [l] shows the existence of an optimal set of possible subsets of size n with respect to 
the Kullback-Leibler (KL) divergence between the full posterior and the sub-posterior. 
This result will be used in the next section to design and justify the LWA-MCMC method¬ 
ology, extending this approach to non-hid. observations from general likelihood models. 

3.1. Notation 

Let (Yi,..., Y/v) G Y N be a set of i.i.d. observed data (Y C M m , m > 0) and define 

. Ui = (u,...,u) ifi < i < j < N with the convention that = {0}, otherwise. 

• Y. k = (Y x ,..., Y fe _!, Y k+1 ,..., Y n ) for all k G {1,..., N}. 

• Y v = {Y fc , k G U}, where U C {1,..., N}. 

In this section, we assume that the likelihood model / belongs to the exponential 
family and is fully specified by a vector of parameters 6 G 0, (0 C M. d , d > 0) and a 
sufficient statistic mapping S : Y —> S (S C M s , s > 0) such that 

f(y I °) = ex P (g(0),s(y)) / l{9) 


is the density of the likelihood distribution with respect to the Lebesgue measure A(dy). 
Here, the symbol (•, •) denotes the canonical inner product in S, g : © —> S is a model- 
specific mapping and L{9) is the likelihood normalizing constant. 

The posterior distribution n is defined on the measurable space (©,$) by its density 
function 


N / r N 

n(0 1 Y 1: „) = m J] f(Y t \8) / p(d«) J] f(Y t | 0), 

k=1 ' ^ k=1 


( 4 ) 


with respect to the Lebesgue measure on (0, d). p is a prior distribution defined on (0, d) 
and with some abuse of notation, p denotes also the probability density function (pdf) 
on 0 accordingly (p(dd) = p(9)X(d9)). 

For all n < N, we define U Tl as the set of the possible combinations of n different 
integer numbers less than or equal to N and Y/ n = 2 Un as the powerset of U n . Finally, let 
U = U„<» u„ and YX = 2 U . In the sequel, we set n as a constant and wish to compare the 
full-posterior distribution Q with any of the sub-posterior distributions from the family 
F n = {7f (■ | Y 1:N , Un), U n G U„}, where for all U n G U n , we have dehned 


7 t{9 


Yu n, u n ) = 1 YuJ = p(0) f(Y t |9)//' p(d9) 

k&Un ' ^ keU n 


f(Xk | 9). 


( 5 ) 
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3.2. Optimal subsets for the Kullback-Leibler divergence between n and n 

Recall that for two measures 7r and tt defined on the same measurable space (©,79), 
the Kullback-Leibler (KL) divergence between 7r and 7r is dehned as: 

KL(7r||7f)=E 7r |log^||^^| . (6) 

Although not a proper distance between probability measures dehned on the same mea¬ 
surable space, KL (7r||7r) is used as a similarity criterion between 7r and 7r. It can be 
interpreted in information theory as a measure of the information lost when 7r is used to 
approximate 7r, which is our primary concern here. We now state the main result of this 
section. 


Proposition 1. Consider the KL divergence between tt = 7r( • | Y\-n) and ttu — 7r( ■ | Yu) G 
F n? we have: 

(*) 

KL (tt||7 tu) < T(n, N, Y 1:N ) + B(U , n, Y 1:N ) (7) 

where T is a deterministic function independent of U and B is a positive function. 

(ii) If the set of optimal subsets U* C U n defined by 

u;.:=|c/eu„, i£s(n) = l£s(>i)} (8) 

l V fc=l n k£U ) 

is non-empty, then for any U G U* , B(U,n,Yi x jf) = 0, hence yielding an optimal 
upper- bound. 

(Hi) For a given subset Uq G U n such that for all U G U ra 


1 

N 


N 

£ S(Y k ) 

k= 1 


n 




k£U 0 


< 



n 


£ S(Y t ) 


k£U 


then we have 

B(U 0 , n, Y 1:N ) < B(U, n, Yi:jv) • 


( 9 ) 


( 10 ) 


In other words, the sub-posterior distributions in F n with subsets having the same suf- 
hcient statistics on average as for the full dataset, will achieve an optimal approximation 
(with respect to upper-bounding the KL divergence). Moreover (J9]) defines an order on 
F„ which implies an order on their relative KL divergence upper-bound (10). The proof 
is outlined in Appendix Appendix A 


3.3. Illustration with a probit model: effect of choice of sub-sample 

We consider a pedagogical example, based on a probit model, to illustrate Proposition 
[l| A probit model is used in regression problems in which a binary variable Y\ G {0,1} 
is observed through the following sequence of independent random experiments, dehned 
for all k G {1,..., N} as: 

(i) Draw X & ~ , y 2 ) 
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r(U) 

KL (ttWttu) /KL ( 71 -1|7ft/.) 

0.01 

1.13 

0.04 

1.39 

0.07 

1.87 

0.1 

2.76 


Table 1: Comparison of the KL divergence between the full-posterior and the optimal sub-posterior 
(r(U) = 0) with the KL distance between the full-posterior and other sub-posterior distribution. 


(ii) Set Y k as follows 


f 1, if X k > 0, 

1 0, otherwise. 


( 11 ) 


Observing a large number of realizations Yi,...,Y/v, we aim to estimate the posterior 
distribution of 9. In practice, one also estimates 7 but for illustration purpose, this 
parameter is considered as known here. The likelihood function can be expressed as 


f(Y k \6)=a(e) Yk (l-a(9)f- Y ^ 


a -<*m 


a{9) 

1 — a (9) 


Y k 


J 


( 12 ) 


where a(9) = f^°(2n r j 2 ) 1//2 exp{(l/ 27 2 )(t—#*) 2 }df and clearly belongs to the exponential 
family. The full-posterior distribution can be written as 


tt(9 I Y 1:N ) oc p{9) (1 


a(9)f 


a(9) 

1 — a(9) 


v-'lV 
2 ^ k =1 


Yk 


1 


where p is the prior density, which we assume to be non informative (p(9) = A/"(a, 6 2 )). 
In this example, the mapping 9 —» n(9 \ Y\ : jv) can be easily estimated for any 9 G 0, 
even when N is extremely large, as it only requires to sum over all the binary variables 
Yi ,..., Yn- As a consequence, samples from the full-posterior distribution 7 r( • | Ypjv) can 
be routinely obtained by a standard M-H algorithm and similarly for any sub-posterior 
distributions 7 r( • | Yjy) 6 F n . 

We present in Figure [2] some inference results for the full-posterior and several sub¬ 
posterior distributions (with different n and different values of sufficient statistics) ob¬ 
tained with parameters (9*, 7 ) = (1,1) and N = 10,000 simulated data. In the upper 
plot, we hold n = 100 constant and compare several sub-posterior distributions given 
subsets of data U G U 100 having different matches with the full data sufficient statistics 
r(U) = \S(U) - S N I, where S(U) = S{Y k ) and S N = N~' S(Y k ). In 

this probit model, S is simply the identity function, implying that r(U) monitories the 
different proportion of 1 and 0’s between the full dataset and in the subset U. This 
plot, as well as the quantitative result of Table [l] providing the KL divergence between 
the full-posterior and these different sub-posterior distributions are consistent with the 
statement of Proposition [lj when learning from a subset of n data, one should work with 
a subset U featuring a perfect match with the full dataset, i.e r(U ) = 0, to achieve an 
optimal approximation of 7 r. Note that the sub-posterior distributions only vary through 
their sufficient statistics, so obviously, in exponential models, the choice of the optimal 

subset is not unique, as U* is not restricted to one element. 
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Figure 2: Influence of the parameter U £ Uioo on the sub-posterior distribution 7r( • | Yjj) and comparison 
with 7 r (top) - Influence of the subsets size n on the sub-posterior distribution n( ■ \ Yu), U £ U* (bottom). 

The lower plot of Figure [2] compares the influence of n on the optimal sub-posterior 
distributions 7r( • | Y v ), [/ 6 U* for n E {50,100,1000,5000}. As expected, while the 
variance becomes wider as n decreases, the expectation remains relatively constant. 

Remark 1. The results of the lower plot of Figure\^may in some ways be related with the 


the "true" parameter is an interior point of the parameter space. More precisely, the 
mean of the Gaussian distribution corresponds to the maximum likelihood estimate of the 
observed data, while the covariance is given by H~ l /n, where H is the Hessian matrix at 
the "true " parameter value. Even though for low values of n, such an asymptotic residt 
does not hold, it is nevertheless consistent with observing in Figure that the variance 
varies (in function of n) much more than the mean of those distributions. 

Remark 2. At this stage, one might wonder why so much effort has been put to overcome 
the problem of sampling from a posterior distribution given a huge amount of data, while 
using a local asymptotic normality theorem would virtually allow to solve this problem 
by sampling from a Gaussian distribution. Two main arguments actually prevent one to 
make use of such a result: 


local asymptotic normality (lan) theorem: Bernstein von Mises theorem (see e.g. Van der 


Vaart, 2000) states that under some mild assumptions about the likelihood and the prior 


distribution, the posterior distribution is asymptotically Gaussian (in n) in the case where 
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• estimation of the coefficients of the asymptotic Gaussian distribution is non-trivial, 
as one needs to invert the Hessian matrix at 9*, which is typically unknown; 

• local asymptotic normality theorems only hold under restrictive assumptions, for 
example, when the observations are i.i.d. 

On the basis of our analysis conducted in the case of i.i.d. realizations from an 
exponential-family model, both at theoretical and experimental levels, it seems reason¬ 
able to consider estimating the Maximum A Posteriori parameter based on a subset of 
data. At first glance, when one aims to estimate 9 using a Markov chain targeting a sub¬ 
posterior, even optimal, 7r( • Yu), {U G U*) instead of the full-posterior may lead to a 
worse efficiency, iteration wise. However, assuming that the computational complexity of 
a Markov chain targeting the full posterior is prohibitively intensive, one may consider a 
Markov chain targeting an optimal sub-posterior as a realistic alternative: more Markov 
chain iterations would be required but at a known and affordable computational cost. 
This yields a trade on the subset size n which will allow lighter transitions at the price 
of a loss in variance. 

4. Light and Widely Applicable MCMC: the general methodology 

In this section, we do not assume any particular correlation pattern for the sequence of 
observations, nor any specific likelihood model and simply write the posterior distribution 
7r as 

vr(d 9 | Yi,..., Y n ) (x p(d9)f(Y 1:N \ 9) . (13) 

The Light and Widely Applicable MCMC (LWA-MCMC) methodology that we de¬ 
scribe now can be regarded as a generalization of the sub-posterior inference detailed in 
the previous section to non-exponential-family models with possibly dependent observa¬ 
tions. 

4-1. Motivation of our approach 

Here, we do not assume the existence of a sufficient statistic mapping for the model 
under consideration. Thus, in order to allow comparison between different subsets of 
data, we introduce an artificial summary statistic mapping S n : Y n — > S (n < N), where 
S C K s . The choice of the summary statistics S n is problem specific and is meant to 
be the counterpart of the sufficient statistic mapping for general models (hence sharing, 
slightly abusively, the same notation). Intuitively, a good choice of S n would capture 
most of the statistical features of Yu (U G U„). In an attempt to derive a similar 
analysis to the exponential model case (Section [3]) and to reach an optimal setup in 
line with Proposition [lj we want to focus inference on those subsets whose summary 
statistics vector is close to that of the full dataset. In this context a subset Y v (U G U n ) 
is said to be more representative of the full dataset than another Yu> ( U' G U„), if 
Pn(Y[/) - Sn(Y\ : n )|| < || S n (Yu') - Sjv(li:Ar)||, where we have set S n (Y L n) = S n (Y w )/n 
as the normalized summary statistics and || • || as the Euclidean distance on S. Since the 
question of specifying summary statistics also arises in ABC, one can take advantage of 
the abundant ABC literature on this topic to find some examples of summary statistics 


for usual likelihood models (see e.g. Nunes and Balding, 2010, 

Csillery et ah, 2010 

Marin 

et al., 2012, 

Fearnhead and Prangle, 2012 

)• 
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We formalise this idea by assigning a weight to any possible subset of data U E U n . 
Let for any n < N and e > 0, v n , e be the distribution defined on the discrete state space 
(U n , %i) whose density with respect to the counting measure is: 


VUE U„, VnAU)=$ 


II S n (Yu ) — sn\ 


E* 


U'SUr, 


| S n (Yu, 


Sn\ 


(14) 


Here $ : M —>• M + is a kernel function, Sn = iSjvO'W) E S and the dependence of u n>€ (U) 
on Y 1:N is implicit. The parameter e allows to control the influence of the representa¬ 
tiveness of a subset U E U n on its overall weight u n ^{U)\ if e 1, the weights tend 
to be uniform, whereas if e <C 1 the weights tend to highlight the representativeness of 
the subsets. As a consequence, e is a tuning parameter whose impact on the inference is 
significant, as we will see in Section [5} The kernel $ allows to smooth the weights and of¬ 
fers protection against the possibly unbounded weights derived from Euclidean distances 
(situations which typically arise when weights are proportional to ||5 n (i7) — •Sjv||~ 1 and 
U* (J8| is not the empty set). In this paper, we let $ be the Gaussian kernel. 


Remark 3. Note that because the statistics used to assess the representativeness of a sub¬ 
set w.r.t. the full dataset are only summary and not sufficient, two subsets (U,U') E U); 
such that i' n ,e{U) = v n ,e(U') might yield two different sub-posteriors. As a consequence, 
should a unique optimal subset U* = argmaXf/ eUri v n%e (U) be available, inferring the full- 
posterior through this corresponding optimal sub-posterior is likely to provide an unbal¬ 
anced learning as most of the data would be simply ignored. Alternatively, when learning 
from a set of good subsets, where goodness is measured by one can expect that each 
sub-posterior involved in the process will act complementarily to improve the approxima¬ 
tion of 7r. 


At this stage, two main questions need to be addressed: 


how can a set of good subsets be determined? Indeed, the dimension of U n , |U n | = 
(^) is prohibitively too large to estimate the normalizing constant in (14) and 
therefore we reasonably assume in the following that the set of weights {u n}e (U), U E 
U„} is unknown. 


(ii) how to design an inference scheme that would make use of these subsets, without 
resorting to a set of independent Markov chains each targeting through a stan¬ 
dard M-H transition kernel a good sub-posterior, which could become even more 
demanding than a standard M-H targeting 7r? 


Because we assume that {u n>e (U), U E U n } is unknown, we consider that the subsets 
Ui, U' 2 , ■ ■ ■ (Ui E U n ) involved in the learning setup are latent variables of the model. We 
define the data-augmented distribution n Uje for any A Ed and U E U n by 

7fn, e (A, U | Y 1:N ) = v n , e (U)n(A | Y v ) , (15) 

where 

7T(de I Yu) = p(9)f(Yu 1 0)A(d0) / J Pie 1 ) f (Yu I e , )x(de l ) 
is a sub-posterior distribution. 
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With some abuse of notation, we write 7r ni£ also for the density of the data-augmented 
distribution with respect to the product measure A(d6 ) )l Un (17). Integrating out the subset 
variable yields the marginal of interest whose density w.r.t. A is expressed as: 


KA e I Y = J2 "nAUMo I Yu). (16) 

C/GU„ 


In this way, the marginal distribution 7 r* e defines an approximation of n. Straightfor¬ 
wardly, as n —>• N, our approximation 7 f* e converges to 7 r. 


Remark 4. The parameters n and e of the joint distribution ( 9 , U), have complementary 
roles in the approximation of n. While n <C N lightens the full-posterior distribution, 
e allows to make up for the approximation by attaching bigger weights to representative 
subsets in the mixture distribution (15). Setting e implies a tradeoff between: 


• e> 1 and v n , e is a flat distribution on U and all subsets have the same weight in 
the mixture 


• e<l and n n>e has most of its probability mass on Q x U* 

In particular, the latter scenario yields an approximation of the full-posterior based only 
on one or a few subsets. Depending on the relevance of the summary statistics, the choice 
eCl may provide a disastrous setup. The simulations detailed in Section [5| clearly show 
the existence of an optimal e for a given n and which also depends on the likelihood model 
considered. 


4.2. Formal description of the LWA-MCMC algorithm 

In this section, we specify LWA-MCMC which allows to sample from an approxima¬ 
tion of 7T. This relies on a Markov chain whose transition kernel achieves our main target, 
namely having a bounded computational complexity, which can be controlled through 
the parameter n. 

4.2.1. LWA-MCMC 

LWA-MCMC makes use of a Markov transition kernel operating on an extended state 
space (0 x 11^,7? 0 %i). Algorithm [l] depicts a Markov transition of the LWA-MCMC 
algorithm. This embeds two successive decisions: the first one allows to refresh the subset 
variable U while the second one updates the parameter 9. R(U, •) and Q[6 , •) are proposal 
kernels respectively on U x % and 0 x 38(Q). 



Figure 3: Intertwined structure of the LWA-MCMC Markov chain. 


The dependence structure within the sequence of variables {9^, k G N} and {L4, k G 
Lf} produced by LWA-MCMC is displayed in Figure pj This yields an intertwined design 
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Algorithm 1 LWA-MCMC transition (0 k , U k ) —* (<4+i, ^4+i) 

1: Input: current state (6k, C4) and summary statistics of current subset S k = 

s„(r n ) 

2: propose to refresh the subset U' ~ R(U k , •) 

3: compute the summary statistics S' = S n (Yjj') 

4: set U k+ i = U' with probability 


Pn,e(U k ,U') 


R(U',U k )*(-\\S'-s N \\M 

RiU^U’^i-WSk-SKW/e)’ 


5: leave U k+ \ = U k otherwise 
6: if the subset has not been refreshed then 
7: propose a change of parameter O' Q(0 k ; •) 

8: set 6 k+ 1 = O' with probability 


a(0 k ,0'\U k ) 




(17) 


(18) 


9: leave 0 k+ 1 = 0 k otherwise 

10: else 

11: set 0 k+ 1 ~ Tu (6 k \ •) where T Uk+i (0 k ] •) is the M-H transition kernel correspond¬ 

ing to steps 7, 8 and 9 iterated L times (L > 1) 

12: end if 

13: return: (0 k+ i,U k+ i) and S' if U k +i ^ U k and S k otherwise 


in which {U k , k G N} is itself a Markov chain targeting v n , e , such that U k is independent 
from the past sequence of parameters (0\, O 2 , ■ ■ ■, 0 k ). This is an important feature of our 
proposal as allowing U k+ \ to depend on 0 k could potentially lead to a local optimisation 
of the parameter 0 with respect to a given subset Yu- Indeed, a parameter 0 k may happen 
to fit particularly well the subset of data Yu k and therefore the U component of the chain 
may get stuck at U k for a large number of iterations. We claim that LWA-MCMC allows 
to move freely throughout U„ at a limited computational cost, a fact which is empirically 
confirmed by the simulations. Moreover, the two distinctive accept/reject steps for U 
and 0 allows the parameter 0 not to be hampered by a lack of move in the U direction. 

To summarize, the LWA-MCMC sampling scheme is appealing as it allows: 

(i) to update the subset U k and the parameter 0 k through two different decisions, 

(ii) to make the subset update independent of 0 k , 

(iii) to control the computational complexity of a transition, 

(iv) to be applied in any inference problem where the likelihood function is tractable. 

The following remark shows that standard Markov chain methods cannot be used to 
sample from 7f n)€ , hence justifying the sampling machinery of LWA-MCMC from another 
perspective. 

Remark 5. A block update M-H or a Metropolis-within-Gibbs algorithm targeting n nj£ 

cannot be implemented, as both involve at some stage an intractable acceptance ratio (the 
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normalizing constant of the sub-posteriors does not cancel anymore because of the different 
subsets of data involved). 


4.2.2. Stability of LWA-MCMC 

Let us assume that (9 k , 14) ~ 7r nj£ . First, note that marginally 

(9k, Uk ) ~ ?Tn,e = ^" Uk ~ Ci,e ^ Uk- 1-1 ~ U n,e , (19) 

where the later implication holds as {Uk, k G N} is in itself a M-H Markov chain with 
target distribution v n . f (See Figure [3] and Algorithm [I] steps 2-5). On the one hand, 
given the event {no subset refresh}, (9k+ 1 , Uk+ 1 ) ~ Tt n ,e simply derives from the fact that 
9k +1 | Uk +1 = Uk ~ 7r( - | Yjj k ) which holds since steps 7-9 in Algorithm [I] consist in a stan¬ 
dard Metropolis-within-Gibbs update for 6. On the other hand, the event {subset refresh} 
disturbs the stationarity: (9k+i, Uk+i) 4 7T n ,e- Indeed, in order to remain invariant, the 
transition kernel should produce a sample 6 k +i such that Ok+i \ Uk +i ~ 7r( • | Uk+i). How¬ 
ever, this is not achievable in a single Markov transition since the target distribution 
7r( • | Uk) becomes instantaneously 7r( • | Uk+i )• Instead, the sample 9k can be regarded as 
the initial state of a Markov chain having 7r( • | Uk+i) as stationary distribution. Tak¬ 
ing L 1 provides a state 9 k +i which is approximately distributed under 7r( • | U k+ 1 ). 
Therefore, the LWA-MCMC transition kernel is not, stricto-sensu, stationary with re¬ 
spect to 7 f n ,e, as when the subset is refreshed, 9 k+ 1 is a sample from the full conditional 
asymptotically in L. At this point we make the following remarks. 

• Since the sequence of distributions {7r(- | Yjj k ), fceN} are likely to be close to each 
other, one can use a very limited number of intermediate steps, typically L — 1 
was used in all the simulations of Section |5j Indeed, depending on the number 
of data refreshed by R, two consecutive subsets Y Uk and Y Uk+1 may only differ 
through very few observations, hence advocating setting L = 1 does not adversely 
effect the convergence of the Markov chain, a statement which is supported by our 
experiments. 


• Tuning the prior distribution with e 1 leads to a set of weights {v U)fi (U), U G U n } 
with high discrepancy. Therefore, when the marginal chain {U k , k G N} reaches 
stationarity, the subset samples yield similar summary statistics and makes the 
distributions 7r( ■ | Uk) and 7r( • | U k+ \) even closer. In addition, this makes refreshing 
U an unlikely event and a transition is therefore most of the time valid. 

Although, we do not carry out further theoretical analysis in this paper and leave the 
proof of stability of the marginal Markov chain {9k, k G N} as the central question of 
a forthcoming paper, we outline two considerations that are likely to help proving the 
ergodicity of the LWA-MCMC Markov chain. 


Considering 7r nj£ as an intermediate target distribution 

The ergodicity of the data-augmented Markov chain relies on the ability of the 
sampler to absorb minor target changes. To the best of our knowledge and proba¬ 
bly as a result of a rather unusual MCMC development, there has been very little 
literature on the convergence of Markov chain toward time-evolving target distribu¬ 
tions. However, we connect this question to the one addressed in Kuhn and Lavielle 


(2004) about the stability of a Markov chain targeting a sequence of distributions, 


each parameterized by a vector which is recursively updated through a Stochastic 
Approximation procedure (Robbins and Monro, 1951). 

- T4 - 










Considering 7r as the ultimate target distribution 

In another perspective, considering the marginal chain of interest {9 k , k £ Lf} and 
regarding {U kl k £ Lf} as an auxiliary sequence of parameters, the acceptance 
probability a(9, 6' \ U ) of the LWA-MCMC can be viewed as an noisy version of 
the M-H acceptance ratio, which is the acceptance probability involved in the M-H 
transition targeting the full-posterior n. This observation combined with the main 
result of Alquier et al. (2014) (see also Pillai and Smith, 2014) stating that a M-H 


transition kernel with a noisy acceptance probability nevertheless allows to obtain 
samples from a distribution whose total variation distance to n is bounded from 
above, provided that for all 64 £ 0, U k +i £ U and 6 Q(o k , •), 

E | a(9 k , 9' | U k+1 ) - a(9 k , 9') \ < 5{9 k , 9 '), 

where the expectation is taken under U k+ \ and 5 : © 2 —>■ M + is a deterministic 
function. 


5. Illustrations of LWA—MCMC 

We evaluate the efficiency of LWA-MCMC in two different applications: the first is 
posterior inference of a time series observed at N = 10' contiguous time steps. This 
example is particularly relevant since the observations are non i.i.d.: as a result, this 
makes LWA-MCMC the only competing algorithm against M-H to infer such a model 


- the other solutions ( 

Korattikara et ah, 2014; 

Bardenet et ah, 2014, 

Maclaurin and 

Adams, 2014, 

Banterle et ah, 2014 

) being only suitable for i.i.d. data. The second task is 


a Gaussian binary classification problem based on N = 10' data, which we use to compare 


LWA-MCMC with the M-H Sub Likelihood inference method proposed in Bardenet et al.| 
(2014). Finally, we provide some additional details to the handwritten digit example 


outlined in Section [2] 

5.1. Inference of an ARM A model 

An ARMA(1,1) time series {Y k , k < N} is defined recursively by: 


Y 0 ~ n, Z 0 ~ jV (0, a 2 ) 

Y k = aTfc-i + /3Zfc_i + 7 + Z k , 


yY(0, a 2 ), V k > 1 , 


( 20 ) 


where p is some distribution on (Y,t^) and 9 = £ M 3 . The likelihood of a 

trajectory can be written 


N 

f(Y 0:N | 9) = n{Y 0 ) J] g k (Y k | Y 0:k _ u 9 ), 
k =i 


( 21 ) 


such that for all k > 1 , 

g(Y k | y 0 :fc-i, 0) = T(W; aY k _ 1 + 0/i fc (y 1:fc _!) + 7 , ^ 2 ) (22) 

where we have dehned x —> T(a:;m, v) as the pdf of the univariate Gaussian distribution 
with mean m and variance v and {h k , k > 1} is a set of known deterministic mappings. 
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The purpose of this example is to infer the posterior distribution i t(6 | Wiv) using 
LWA-MCMC. We sampled a time series {Y k , k < N} according to (20), with N = 10', 


/i = A/"(0,1) and using 6* = (0.5,0.7, 0.1). The prior distribution on 6 is deliberately 
non-informative and taken as a Gaussian with mean (0, 0, 0) and a diagonal covariance 
matrix with a large variance. We restrict the subset parameter to the the set U n C U„ 
involving n contiguous observations: 

Un {Tp-n—1, ll : ni ■ ■ ■ ■ Y]\f_ n+l:jv} • 


Using such a sub-window yields a tractable likelihood (21) as otherwise, the function h k 


in (22) is no longer explicit. We compare the efficiency of LWA-MCMC with respect 
to M-H in function of n, e and S. In all this section, the simulations were achieved 
with a gaussian proposal kernel Q with variance tuned so that the acceptance rate of 
the sequence {6% k E N} is between 30% and 40%. For LWA-MCMC, a subset U E U„ 
is identified with its starting index and the proposal kernel R consists in assigning a 
probability on the discrete alphabet {1,..., N n + 1} standing for all the possible starting 
time for subsets U E U n . More precisely, R can be written as 


R(n 0 ; n' Q ) oc u exp (—A|n 0 — n' 0 \) + (1 — oj) exp (—Ain' — n' 0 \), 

■ri ~ N -n+l}). (23) 

The rationale is to propose U' ~ R{U, •) through a mixture of two distributions: the 
first gives higher weight to local moves and the latter allows jumps to remote sections 
of the time series (through the offset n r ). This mixture allows to browse through 0„, 
while avoiding to remain trapped in local optimal subsets. In this example, we have used 
u = 0.9 and A = 0.1. 

Figure [4] displays the sample path of three different Markov chains (blue, black and 
red) simulated during a fixed time budget (1 hour). The blue dashed sample path is 
the standard M-H targeting the full-posterior 7r and the other two correspond to the 
LWA-MCMC transition kernel for n = 10, 000 and n — 1, 000, with e = 1. In this setup, 
the summary statistics vector was defined as S = Sq with 


VL e0 n S 0 (Yu) = (q. 2 (Y U ),q. 5 (Y u ),q.s(Y u ),p 1 (Y u ) i ...,p 5 (Y u )) , (24) 

where for any A E (0,1), q\{Yu) is Y v A-quantiles and for all p E N, p p (Yu) is the p- 
lag sample autocorrelation. As expected, the time to reach stationarity is significantly 
reduced when sampling through LWA-MCMC. Actually, the M-H sampler does not even 
reach the high density area of the support during this time frame. On the other hand, 
when reducing n the time to reach convergence is even smaller but the stability of the chain 
may be affected. This is illustrated with Figure [5] which shows the posterior distribution of 
the three parameters 7r* e for e = 1, S — S 0 , n = 10, 000, n — 1, 000 and n = 100 obtained 
through a time-normalized LWA-MCMC simulation. For the setup featuring n = 100, 
the efficiency of the sampler could be criticized, since a slight bias results. Nonetheless, 
in an attempt to correct this behaviour, the parameter e is decreased, making the choice 
of subset more critical. Here, Figure [6] shows that it is worthwhile to penalize subsets 
which are less relevant with respect to the summary statistics Sq. More precisely, since 
we are interested in the posterior distribution of the parameter we have reported the 
mean density observed along the confidence intervals corresponding to the .2-quantiles 
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Figure 4: ARMA example - M-H (dashed, blue) and LWA-MCMC (dashed, red and black) (e = 1, 
S = Sq) for a fixed computational budget. 


and .8-quantiles. Indeed, because each run is a stochastic process (a random collection 
of subsets is sampled) there is variability within the collection of posterior distributions 
provided by each run (especially when eCl). These results were obtained through 100 
independent runs of LWA-MCMC for n = 100, S — So and e G {1,10 -2 ,10 -4 }, each run 
featuring 200, 000 iterations among which the 10, 000 first were discarded for burn-in. 
The initial states of all chains were drawn from the prior distribution. Moreover, we have 
included two extra setups which can be regarded as "extreme" scenarios: 

• free subset - a standard M-H kernel targeting a sub-posterior which refreshes 
uniformly the n = 100 data of the subset at each transition, 

• fixed subset - a standard M-H kernel targeting a sub-posterior with a fixed subset 
of n — 100 data chosen uniformly before the first iteration. 

While the first one can be seen as a LWA-MCMC kernel with e —> oo since the relevance 
of the subset does not interfere the parameter sampling, the latter is related with a LWA- 
MCMC kernel which would be trap in a single subset throughout all the sampling scheme, 
hence e —> 0. The vertical green and blue lines respectively indicate the true parameter 
value and the expectation of the mean sub-posterior distribution targeted by the different 
MCMC. Clearly for n = 100 and S = So, the parameter e = 10~ 2 yields an optimal setup 
since for all the three parameters, the expectation of the posterior distributions meets the 
true value. Beyond this optimal e, the weights \y n ^(U), U G U n } become too unbalanced. 
As a result some relevant subsets are simply ignored and the inference is performed on 
a restricted set of subset. Table [ 2 ] shows indeed that for e < 10~ 2 , the subset may be 
refreshed relatively infrequently. 

Finally, we consider for the setup n = 100 and e = 10~ 2 , two other possible choices 
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setup / e 

free subset 

1 

10" 1 

10~ 2 

10" 3 

o 

1 

fixed subset 

U refresh rate 

1 

.81 

.34 

.05 

.001 

10“ 4 

0 


Table 2: ARMA example - Refresh probability of the subset U when targeting n n e with n = 100 and 
S = S 0 . 



tk 



tk K 


LWA-MCMC, n=100, e=l, S 0 




A 


" T ' 'CM " 


qp i 

_ A _._._ 

B k Ik 


Figure 5: ARMA example - Boxplot of 100 independent LWA-MCMC sample paths for different n - 
green lines represent the true parameter 9*. 


for S: 


Si(Yu) = (pi(Yu),..., pi 5 (Y lT )) , S 2 (Yu) = (min(y C7 ), max(y [/ )) . 

Figure [7] shows that this naive choice of S 2 as summary statistics yields a poor sub¬ 
posterior target. Indeed, the minimum and the maximum of the time series does not 
characterize much the model parameters. Note that this setup gives worse results than 
the free subset setup: while the latter does not advantage any subset U G U n , the LWA- 
MCMC setup with n = 100, e = 10 -2 , S = S 2 uses more likely subsets which match 
the min/max statistics of the full dataset and which, as a result, may force the sampler 
to infer the parameters through those unrepresentative subsets. Indeed, the min/max 
stopping times are likely to be far from each other as they represent large deviations to 
the stationary process. Therefore the subsets whose min/max statistics of the n = 100 
contiguous time step are close to that of the full dataset can reasonably be labeled as 
"anomalies". On the other hand, when the inference is performed using S = Si, the 
collection of subsets is only guided through the correlation between consecutive states. 
We see that the results are better when these relative statistics are coupled with global 
statistics such as the mean and the two quantiles q, 2 and q.g, like in S = Sq. 


5.2. Binary classification 

In this application, we consider the binary classification example used in the MCMC 
SubLikelihood (MCMCSubLhd) approach (Bardenet et ah, 2014). MHSubLhd is an 
alternative to M-H, designed to perform Bayesian inference in large datasets contexts; 
see (|2]) and the corresponding introduction section for more details. We consider here 
the example in Section 4.2 of Bardenet et al. (2014) in order to compare the efficiency 
of LWA-MCMC and MHSubLhd. A large number (.N = 10') of realizations from a 
two-dimensional Gaussian mixture distribution with two classes were sampled with the 
following parameters 

Pi = [-1, 0], p 2 = [1, 0], Si = S(cr 2 = .25), S 2 = S(a 2 = .25), E(cr) = cliag(a 2 ; a 2 /2), 
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Figure 6: ARMA example - Mean density and confident intervals for each parameter, obtained through 
100 independent LWA-MCMC runs and for different e. Vertical blue lines give the mean of the mean 
density and vertical green lines the true parameter 8*. 
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Figure 7: ARMA example - Mean density and confident intervals for each parameter, obtained through 
100 independent LWA-MCMC runs and for different summary statistics S. Vertical blue lines give the 
mean of the mean density and vertical green lines the true parameter 6*. 



Figure 8: Binary Classification example - Samples from the two-dimensional mixture of two Gaussian 
distributions. 


for the mean and covariance matrix of the two components respectively; see Figure [8] 
Both LWA-MCMC and MHSubLhd provide a sequence of parameters {0j,k — a j,k), 
j G {1,2}, k e N} which is used to classify test-data {Y m , m < 10 7 }, sampled from 
the same model, through the following real-time maximum likelihood classifier defined at 
time t > 0 by: 


C t (Y m ) 


arg max f(Y u 
ie{ 1 , 2 } 


where 


Vf e M, n{t) = max fceN {f > r k } , 

Tk is the time at the end of the k -th iteration . 


On the basis of the time series simulation example, LWA-MCMC was tuned with 
n = 1,000, e = .01 and nS(U ) = 1{4 =0 }; l{ 4 =i}] • The specific MHSubLhd 

parameters were chosen accordingly: 

• 6 = .1, which states that the decision to accept/reject a proposed candidate in 
MHSubLhd is in accordance with the M-H decision with probability 1 — 5 — .9 


every time a decision (accept/reject a candidate) is postponed, rq = 1000£ new 
data are added to the current subset, where £ G N* is the number of times that a 
decision has been postponed in the current MCMC transition (the size of the subset 


increments, following the guidelines provided in the Section 2.2 of Bardenet et al. 


(2014); 
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scenario 

1 

2 

3 

4 

LWA-MCMC 

1,000 

1,000 

1,000 

1,000 

MHSubLhd 

2.4 10 6 

1.9 10 6 

2.35 10 6 

2.8 10 6 


Table 3: Binary Classification example - Average number of data used per transition. 


Finally, the same proposal kernel were used for both samplers: 


j ~ Unif(l, 2), (ei, e 2 ) ~ M (0,1), 


/') = /'., + \i< i 
o] = exp(^e 2 )cr j 


where the parameter {y,-, Qj}j<={ 1 , 2 } were updated to maintain an acceptance rate between 
.25 and .35 through an adaptive Metropolis procedure (Haario et ah, 2001). 


Figure [9] shows four independent comparisons between LWA-MCMC and MHSubLhd. 
Both samplers starts with the same initial state (drawn from the prior) in all scenarios. 
The first column shows the sample path of the two Markov chains against the time (in 
second): plain lines are LWA-MCMC sample paths and dashed lines are MHSubLhd 
sample paths. We stress that the step shape of the MHSubLhd sample paths does not 
highlights a poor mixing chain but illustrates the fact that a single MHSubLhd transition 
can take a similar amount of time as a standard M-H transition. As a consequence, the 
chain remains at the same state for a large amount of time. Table [3] shows indeed that, 
on average, the MHSubLhd sampler ends up using almost 25% of the full dataset at each 
transition. The results clearly show that, in this application, using LWA-MCMC instead 
of MHSubLhd results in a practically useful approach, with some spectacular convergence 
acceleration: in scenario 4, LWA-MCMC is 22 times faster than MHSubLhd. 


5.3. Additional details for the handwritten digit example of Section^ 

In the handwritten digit example (see Section [2j, we have used batches of n = 100 
data. The summary statistics were simply defined so as to promote subsets which have 20 
observations from each class. A very low bandwidth e = 10 -5 was used in order to enforce 
this characteristic. Similarly to the binary classification example, the proposal kernels of 
LWA-MCMC and M-H were defined in the same style, namely a Random Walk kernel 
where at each iteration only a bloc of the template parameter of one of the 5 classes is 
updated (6 e R 256 in this example). The variance parameter of the Random Walk is 
here again adapted according to the past trajectory of the chain, so as to maintain an 
acceptance rate of .25. 

In contrast to the previous examples, the computational difference is not explained 
by the fact that the M-H acceptance probability is more expensive to compute. Indeed, 
both need to evaluate the function 0(d)) : R 256 —>• R 225 for the updated class j, which 
represents the only heavy routine calculation. In fact, the adjusted variance of the M-H 
proposal turns out to be 10 times lower than that for LWA-MCMC. Loosely speaking, on 
the one hand, the M-H proposal needs to provide a parameter 6' which fits in about 2, 000 
observations: the log of the acceptance ratio depends of Ylk =1 l 4 =j{||W — 0(d))|| 2 — || 10 — 
0(dj)|| 2 } for the updated class j. On the other hand, the LWA-MCMC proposal should 
match only about 20 images through the quadratic term lj fc =j{||L0 — 0(d))|| 2 — 

1110 — 0(dj)|| 2 }. As a consequence, the M-H adapted variance makes the Random Walk 
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Figure 9: Binary Classification example - Four independent scenarios - left column: sample paths of the 
two samplers (LWA-MCMC in plain and MHSubLhd dashed) ; right column: classification error rate on 
the training dataset (LWA-MCMC in plain and MHSubLhd dashed). 
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less efficient, which results in the Markov chain { 9 *., k G N} exploring the state space 0 
slower. 

6. Conclusion 

The Light and Widely Applicable MCMC methodology introduced and discussed in 
this paper is attractive as it overcomes the critical issues encountered by the popular 
Metropolis-Hastings sampler in the modern development of big data inference problems. 
Several recent noisy M-H methods have been proposed to address these issues. However, 
(i) they are only valid for hid.realizations and (ii) they may use a significant portion 
of the dataset negating any potential computational saving. LWA-MCMC pushes the 
approximation one step forward to preserve the celebrated M-H simplicity. The efficiency 
of the sampler is illustrated in several typical Bayesian problems including parameter 
estimation and classification. 

Our experiments have illustrated the usefulness of LWA-MCMC. Future work should 
extend the theory beyond the scope of exponential models. 
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Appendix A. Proof of Proposition [l] 

Proof. Without loss of generality, we take g as the identity on 0. By straightforward 
algebra, 


KL MIIM = E, { log |Aj = / K (0), S(Y k ) - g S(Y„) ) + (n - JV)E» M(9)} , 

Z(Yv) 


N 


log 


Z(Y 1:N ) 


, (A.l) 


where £(0) = log L(9). Now, let 


N 


feet/ fc=i 


(A.2) 


and express (A.l) as: 


KL MUM = (l - -|) ( E. (9), J2 S(Y t ) j+(n- N)E„ {<(«)} - n <&, E„(9)} , 

Z(M 


fc=l 


log 


We now want to find an upper bound of log Z(Yu)/Z(Y 1:JV ). First, note that 


(A.3) 


Z{Y V ) = / p(dfl)- 


exp (n/N) (0,Y^k=i S (Y k ) 


L(ey 


exp (n (0,£u)) 


N 


= j p(d0) exp (n(0,&)) 

= f pm n f(Y„ | 9) exp (n (9, &» jj] f(Y k 19) 
Because n/N G (0,1), we can write 

' N 

II/(n|9)f <a(UJ»-‘ 


k =1 


1 


f-1 


(A-4) 


— — 1 

TV - 1 - 




where a(Fi:n) = infuse /(^fc I $)• As a consequence, we have: 

^(Vi :JV ) 


<a(Yi : „)Ar / 7r(d6» | Y 1:N ) exp (n (6, &)) 


= a(Y 1:n ) N 1 E^{exp(n(6»,^))} (A.5) 
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Plugging (A.5) into (A.3) yields 


KL MM < (l - | (9 ), S{Y k )\ - log a(Y 1:N ) J + (n - N)E n {£(9 )} 


y(n,N,Y 1:N ) 

+ log E,,. {exp {n (9, £[/))} — n (&/, E n (9)) . (A.6) 


v- 

n(U,n,Y 1:N ) 


Now, using the Cauchy-Schwartz inequality we have: 

Q.(U,n,Y 1:N ) = logE^ [exp {n {9,^u) - n (&, E 7r (6»))}] 

= logEjr [exp{n(6> - E 7r (6>), ^)}] < logE^ {exp (n||0 - E w (0)|| • Uu\\)} ■ 

'o-----«" 

B(U,n,Yi,N) 

Finally we have 

KL (7r||7Tj/) < ^(n, N,Yi : n) + B(U, n, Li : jv), where B(U,n,Yi : N ) > 0 

such that the subset U G U„ minimizing ||^c/|| also minimize B(U,n, LW) with the special 

case 

U e \j* n ^ B(U,n,Y 1:N ) = 0, 

hence completing the proof. □ 
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